-
Notifications
You must be signed in to change notification settings - Fork 262
Add support for direct store in epilogue and padding support for wave transfer without transpose #3465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
32a7c74 to
461a40d
Compare
kabrahamAMD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, good work
| index_t, | ||
| index_t) | ||
| { | ||
| // Notes: padding is currently not supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update comment (or remove, as it provides no additional information to the error message)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
461a40d to
9134cbb
Compare
|
|
||
| // Limitations of the current implementation: | ||
| // - no multiAB | ||
| // - GemmSpecialization Default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like GemmSpec default is no longer a hard requirement for using WaveTransfer, but only when A B layouts are not row column respectively? If so update comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
| // We need to investigate if it makes sense to remove cshuffle for smaller types | ||
| // Currently we use direct store for NRepeat equal to 4 or 8. For 16 bit type we use at | ||
| // lease buffer store 64 bit for 16 contiguous threads -> 128 bytes in toral (full cache line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typos: lease, toral
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
LGTM, had some small notes and I have a few questions:
|
Discussed offline. Basically benchmarks have been performed and wavetransfer did seem to be a decent bit better for 2D. No significant improvement for 3D so no instances added. We want to look into more automated benchmarks later. |
64462d1 to
7d685e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds support for direct store in epilogue and padding support for wave transfer without transpose, enabling performance improvements for small K problems and grouped convolution operations on GFX12 hardware.
Changes:
- Added direct store epilogue option that bypasses cshuffle for better performance on small K problems
- Implemented padding support for wave transfer without transpose operations
- Added wave transfer with interleaved layout to support direct store functionality
- Enabled new wave transfer instances for grouped convolution forward operations (F16 and BF16)
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
device_grouped_conv2d_fwd_wmma_cshufflev3_wave_transfer_nhwgc_gkyxc_nhwgk_f16_instance.cpp |
New F16 instance definitions for wave transfer grouped convolution |
device_grouped_conv2d_fwd_wmma_cshufflev3_wave_transfer_nhwgc_gkyxc_nhwgk_bf16_instance.cpp |
New BF16 instance definitions for wave transfer grouped convolution |
device_grouped_conv_fwd_wmma_cshufflev3_wave_transfer_instance.hpp |
Template instantiations for wave transfer instances with various tile sizes |
gridwise_ab_transfer_wave_tiles_interleave.hpp |
New interleaved wave tile transfer implementation for direct store support |
epilogue_direct_store.hpp |
New direct store epilogue that bypasses cshuffle for improved performance |
gridwise_gemm_wmma_cshuffle_v3_common.hpp |
Core logic for conditional epilogue type selection and wave transfer applicability checks |
gridwise_gemm_wmma_cshuffle_v3.hpp |
Added IsFusedKernel template parameter to control direct store usage |
gridwise_ab_transfer_wave_tiles.hpp |
Added padding support via PadGridDescriptor helper method |
epilogue_cshuffle_v3_wmma_base.hpp |
Added IsLDSNeeded() method for LDS usage detection |
device_grouped_conv_fwd_multiple_abd_wmma_cshuffle_v3.hpp |
Major refactoring to pass M_K and N_K descriptors, enable runtime descriptor transformation, and force MN padding |
device_grouped_gemm_wmma_splitk_cshuffle_v3.hpp |
Updated to use conditional epilogue type selection |
device_batched_gemm_wmma_cshuffle_v3.hpp |
Updated to use conditional epilogue type selection |
device_batched_gemm_wmma_cshuffle_v3_b_scale.hpp |
Updated to use conditional epilogue type selection |
device_gemm_reduce_wmma_cshuffle_v3.hpp |
Added IsFusedKernel=true to disable direct store for fused kernels |
device_gemm_multiple_d_layernorm_wmma_cshuffle_v3.hpp |
Added IsFusedKernel=true to disable direct store for fused kernels |
device_gemm_bias_add_reduce_wmma_cshuffle_v3.hpp |
Added IsFusedKernel=true to disable direct store for fused kernels |
CMakeLists.txt |
Added new source files to build configuration |
grouped_convolution_forward_wmma_cshufflev3.inc |
Added function declarations for new wave transfer instances |
grouped_convolution_forward.hpp |
Added instance registration calls for new wave transfer operations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Need to add template parameter. It will be removed during refactoring
7d685e7 to
ad8995e
Compare
Proposed changes
Summary:
Checklist
Please put an
xinto the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.clang-formaton all changed filesDiscussion
If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered